On Shanks' Algorithm for Modular Square Roots 



Abstract 

Let p be a prime number, p = + 1, where q is odd. D. Shanks 
described an algorithm to compute square roots (mod p) which needs 
0(log(? + n^) modular multiplications. In this note we describe two mod- 
ifications of this algorithm. The first needs only 0(logq + n^^^) modular 
multiplications, while the second is a parallel algorithm which needs n 
processors and takes 0(logg + n) time. 
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In [5], D. Shanks gave an efficient algorithm for computing square roots 
modulo a prime. Ifp — 2"'q+l, this algorithm consists of an initialization, which 
takes 0{logq) modular multiplications, and a loop, which is performed at most 
n times and needs n modular multiplications at most. Hence the total cost are 
0(log q+n^) modular multiplications. This is actually the normal running time, 
for S. Lindhurst [1] has shown that on average the loop needs j{n'^ + 7n— 12) + 
l/2"~^ modular multiplications. For most prime numbers p, n is much smaller 
then -y/logq, hence the initialization will be the most costly part, however, prime 
numbers occuring "in practice" are not necessarily random, and if p — 1 is 
divisible by a large power of 2, the loop becomes more expensive then the 
initialization. In this note we will give two modifications of Shanks' algorithm. 
The first algorithm needs only 0(log q+n^^"^) modular multiplications, while the 
second is a parallel algorithm running on n processors which needs 0(log q + n) 
time. On the other hand both our algorithms have larger space requirements. 
Whereas Shanks' algorithm has to store only a bounded number of residues 
(mod p), our algorithms have to create two fields, each containing n residues 
(mod p). However, on current hardware this amount of memory appears easily 
manageable compared to the expenses of the computation. 

We assume that looking up an element in a table of length n is at most as 
expensive as a modular multiplication, an assumption which is certainly satisfied 
on any reasonable computer. 

First we give a description of Shanks' algorithm. We assume that we are 
given a prime p = 2^q + 1, a quadratic residue a and a noresidue n, and are to 
compute an x such that = a (mod p). Then the algorithm runs as follows. 
Algorithm 1: 



1. Set fc = n, ^ = u«, X = a(9+i)/^ h = a«. 

2. Let ni be the least integer with 6^"' = 1 (mod p). 

3. Set t = , z = t^, b = bz, x = xt. 

4. If 6 = 1, stop and return x, otherwise set k = m and go to step 2. 
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It is easy to see that the congruence x'^ = ah (mod p) holds at every stage 
of the algorithm, hence, if it terminates we really obtain a square root of a. 

To show that this algorithm terminates after at most n loops, consider the 
order of b and z (mod p). After the first step, the latter is 2" = 2*^, since u 
is a nonresidue, whereas the first one is strictly smaller, since a is a quadratic 
residue. In the second step the order of h is determined to be exactly 2™, and 
in the third step z is replaced by some power, such that the new value of z 
has order exactly 2™, too. Then b is replaced by bz, thus the order of the new 
value of b is 2™~^ at most. Setting k = m, we get the same situation as before: 
the order of z is exactly 2*^, and the order of b is less. Hence every time the 
loop is executed, the order of b is reduced, at the same time it always remains 
a power of 2. Hence after at most n loops, the order of b has to be 1, i.e. 6=1 
(mod p). 

The next algorithm is our first modification of Algorithm 1. 
Algorithm 2: 

1. Set k = n, z = ui, X = a(9+i)/2^ b = a«. 

2. Compute z"^ , z'^'^ , z"^^ , . . . , z^" and store these values in an array. 

3. Compute 6^, 6^^, 6^^, ... , 6^" and store these values in an array. 

4. Set i = 1, bo = b, zq = z 

5. Let m be the least integer, such that bo" zf" ■ ■ ■ z?" = 1 (mod p). 

6. Set t = z'f \ Zi+i =t^,b = bzi+i, x = xt, i = i + 1, k = m. 

7. If 6 = 1, stop and return x. 

8. If i < \/n, continue with 5, otherwise set z = Zi+i and continue with 3. 

First observe that there are no essential changes to the algorithm. The 
only difference is that in step 5 - which corresponds to step 2 in the original 
algorithm - no cxplicitc reference to b is made, but h is replaced by bgZi ■ ■ ■ Zi. 
Of course, the numerical value of these expressions is the same, however, we 
claim that in the form above the algorithm needs only 0{\ogq-\- n^/^) modular 
multiplications. 

Note first that for any i at any stage in the algorithm, Zi = u^'^ for some 
integer and the same is true for t. In fact, the only point where some oper- 
ations are performed with these numbers is in line 6, where a certain number 
of squarings are performed, however, the effect of this operation is just a shift 
within the array of precomputed values. Hence, for any exponent m and index 
i, zf can be obtained by looking up in the array generated in step 2. After 
this remark we can compute the running time. The inner loop is performed at 
most n times, hence step 6 needs 0{n) modular multiplications alltogether. The 
outer loop is performed at most [-ynj-times, hence step 3 requires n"^/^ modular 
multiplications alltogether. Step 2 requires n multiplications and is performed 
once, and steps 1, 4, 7 and 8 can be neglected. 
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Hence we have to consider step 5. The check whether for a given m' the con- 
gruence 6^'" zf" ■ ■ ■ zf" = 1 (mod p) holds true, can be done using i modular 
multiplications, since all the powers can be obtained by looking up in the arrays 
generated in step 2 and 3. We already know at this stage that the congruence 
holds for ml = k, hence we compute the product for m' = fc — 1, fc — 2, . . ., 
untill we find a value for m such that the product is not 1 (modp). Do- 
ing so we have to check k — m values m', hence at a given stage this needs 
(A: — m)i = 0{{k — m)y/n) modular multiplications. To estimate the sum of 
these costs, introduce a counter v, which is initialized to be in step 1 and 
raised by one in step 5, that is, v coimts the number of times the inner loop 
is executed. Define a sequence (rn„), where m„ be the value of m as found in 
step 5 when n = u. With this notation the costs of step 5 as estimated above 
are 0((m,^_i — m,^)^/n), and the sum over p telescopes. Since mi < n, and 
m^i = 1, where vi is the value of the counter v when the algorithm terminates, 
the total cost of step 5 is 0(n^/^). 

Putting the estimates together we see that there is a total amount of 0{n^^'^) 
modular multiplications. In the same way one sees that we need 0(n^/^) look 
ups, and by our assumption on the costs of the latter operation we conclude 
that the running time of Algorithm 2 is indeed 0{logq + n^/^). 

Finally we describe a parallel version of Algorithm 1: 

Algorithm 3: 

1. Set k = n, z = u«, x = a^^+^^Z^^ b = a«. 

2. Compute z'^, z"^^ , z"^^ , . . . , and store these values in an array. 

3. Compute 6^, 6^^, 6^^, ... , 6^" and store these values in an array. 

4. Let m be the least integer, such that 6^"* = 1 (mod p). 

5. Set t = z'^'° ^ , z = t'^ , X = xt, k = m. 

6. Set b = bz, compute b^, b^^ , . . . , 6^"* and replace the powers of b by these 
new values. 

7. If 6 = 1, stop and return x, otherwise continue with step 4. 

It is clear that this algorithm is equivalent to Algorithm 1, furthermore all 
steps with the exception of step 6 can be performed by a single processor in 
time 0{logq + n). Now consider step 6. This step has to be executed at most n 
times, and we claim that it can be done by n processors in a single step. Indeed, 
since all relevant powers both of the old value of b and of z are stored, each of 
the powers of the new value of b can be obtained by a single multiplication, and 
all these multiplications can be done independently from each other on different 
processors. Hence, Algorithm 3 runs in time 0(logg' + n) on m processors. 
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